Skip to content

fix: explicitly delete forward_data_store to prevent GPU memory leak#1638

Closed
lilei199908 wants to merge 1 commit intomainfrom
fix/megatron-memory-leak-forward-data-store
Closed

fix: explicitly delete forward_data_store to prevent GPU memory leak#1638
lilei199908 wants to merge 1 commit intomainfrom
fix/megatron-memory-leak-forward-data-store

Conversation

@lilei199908
Copy link
Collaborator

Summary

Fixes a GPU memory leak in forward_only() in slime/backends/megatron_utils/model.py.

Problem

forward_data_store is a list of dicts accumulated across all microbatches during the forward pass. On non-last pipeline stages, mpu.is_pipeline_last_stage() is False, so none of the data in forward_data_store is extracted into rollout_data. As a result, any GPU tensors held inside forward_data_store remain live until the Python garbage collector reclaims them, which may be delayed in long-running training loops, causing unnecessary GPU memory pressure.

Fix

Add an explicit del forward_data_store after its contents have been fully consumed (after the rollout_data population block), so GPU tensor references are dropped as early as possible.

Impact

  • Non-last pipeline stages: GPU memory held by forward_data_store is freed immediately after use, reducing peak memory consumption during rollout.
  • Last pipeline stage: Minor benefit — the dict/list wrapper overhead is freed, though the tensor data itself is still referenced by rollout_data.

Notes

This is a targeted, minimal fix. A complementary improvement would be to call torch.cuda.empty_cache() after deletion if aggressive memory reclamation is needed, but that is left as a separate concern.

Copilot AI review requested due to automatic review settings February 27, 2026 05:04
@lilei199908 lilei199908 added the bug Something isn't working label Feb 27, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a GPU memory leak in the forward_only() function by explicitly deleting forward_data_store after its contents have been consumed. The issue occurs when GPU tensors accumulated in forward_data_store during forward passes are not immediately released on non-last pipeline stages, causing unnecessary memory pressure in long-running training loops.

Changes:

  • Adds an explicit del forward_data_store statement after rollout data population to immediately free GPU memory references
  • Includes a comment explaining the purpose of the deletion

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

values = origin_values
rollout_data[f"{store_prefix}{key}"] = values

# 显式释放 forward_data_store 以避免显存泄漏
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is written in Chinese while all other comments in this file are in English. For consistency with the rest of the codebase, this comment should be in English. Consider changing it to: "Explicitly delete forward_data_store to prevent GPU memory leak"

Suggested change
# 显式释放 forward_data_store 以避免显存泄漏
# Explicitly delete forward_data_store to prevent GPU memory leak

Copilot uses AI. Check for mistakes.
On non-last pipeline stages, forward_data_store accumulates GPU tensors
from microbatch outputs that are never transferred to rollout_data. These
tensors were held in memory until the local variable went out of scope,
which in long-running training loops could delay GPU memory reclamation.

Explicitly delete forward_data_store after its data has been fully
consumed to release references to these tensors as early as possible.
@lilei199908 lilei199908 force-pushed the fix/megatron-memory-leak-forward-data-store branch from e19c443 to 718c3d4 Compare February 27, 2026 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants